{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
"friends_count" : 145,
"profile_sidebar_fill_color" : "E5507E",
"location" : "Ireland :)",
"verified" : false,
"follow_request_sent" : null,
"favourites_count" : 1,
"profile_sidebar_border_color" : "CC3366",
"profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
"geo_enabled" : false,
"created_at" : "Sun May 03 19:51:04 +0000 2009",
"description" : "",
"time_zone" : null,
"url" : null,
"screen_name" : "Catherinemull",
"notifications" : null,
"profile_background_color" : "FF6699",
"listed_count" : 77,
"lang" : "en",
"profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
"statuses_count" : 2475,
"following" : null,
"profile_text_color" : "362720",
"protected" : false,
"show_all_inline_media" : false,
"profile_background_tile" : true,
"name" : "Catherine Mullane",
"contributors_enabled" : false,
"profile_link_color" : "B40B43",
"followers_count" : 169,
"id" : 37486277,
"profile_use_background_image" : true,
"utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}
In [8]:
%%writefile using_match_project.py
#!/usr/bin/env python
"""
Write an aggregation query to answer this question:
Of the users in the "Brasilia" timezone who have tweeted 100 times or more,
who has the largest number of followers?
The following hints will help you solve this problem:
- Time zone is found in the "time_zone" field of the user object in each tweet.
- The number of tweets for each user is found in the "statuses_count" field.
To access these fields you will need to use dot notation (from Lesson 4)
- Your aggregation query should return something like the following:
{u'ok': 1.0,
u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
u'followers': 2597,
u'screen_name': u'marbles',
u'tweets': 12334}]}
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson,
the aggregation pipeline should be a list of one or more dictionary objects.
Please review the lesson examples if you are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided. If you want to run this code
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the twitter dataset used
in examples in this lesson. If you attempt some of the same queries that we looked at in the lesson
examples, your results will be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{'$match' :{'user.time_zone' : 'Brasilia',
'user.statuses_count': {'$gte' : 100}}},
{'$project': {'followers':'$user.followers_count',
'screen_name': '$user.screen_name',
'tweets': '$user.statuses_count'}},
{'$sort': {'followers': -1}},
{'$limit': 1}
]
return pipeline
def aggregate(db, pipeline):
result = db.tweets.aggregate(pipeline)
return result
if __name__ == '__main__':
db = get_db('twitter')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
import pprint
pprint.pprint(result)
assert len(result["result"]) == 1
assert result["result"][0]["followers"] == 17209
This is the example document of cities in MongoDB
{
"_id" : ObjectId("52fe1d364b5ab856eea75ebc"),
"elevation" : 1855,
"name" : "Kud",
"country" : "India",
"lon" : 75.28,
"lat" : 33.08,
"isPartOf" : [
"Jammu and Kashmir",
"Udhampur district"
],
"timeZone" : [
"Indian Standard Time"
],
"population" : 1140
}
In [9]:
%%writefile using_unwind.py
#!/usr/bin/env python
"""
For this exercise, let's return to our cities infobox dataset. The question we would like you to answer
is as follows: Which region in India contains the most cities?
As a starting point, use the solution for the example question we looked at -- "Who includes the most
user mentions in their tweets?"
One thing to note about the cities data is that the "isPartOf" field contains an array of regions or
districts in which a given city is found. See the example document in Instructor Comments below.
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation pipeline
that can be passed to the MongoDB aggregate function. As in our examples in this lesson, the aggregation
pipeline should be a list of one or more dictionary objects. Please review the lesson examples if you
are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided. If you want to run this code
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the cities collection used in
examples in this lesson. If you attempt some of the same queries that we looked at in the lesson
examples, your results may be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{'$match': {'country' : 'India'}},
{'$unwind': '$isPartOf'},
{'$group' :{'_id':'$isPartOf',
'count': {'$sum' :1}}},
{'$sort' : {'count' : -1}},
{'$limit': 1}
]
return pipeline
def aggregate(db, pipeline):
result = db.cities.aggregate(pipeline)
return result
if __name__ == '__main__':
db = get_db('examples')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
print "Printing the first result:"
import pprint
pprint.pprint(result["result"][0])
assert result["result"][0]["_id"] == "Uttar Pradesh"
assert result["result"][0]["count"] == 623
In [11]:
%%writefile using_push.py
#!/usr/bin/env python
"""
$push is similar to $addToSet. The difference is that rather than accumulating only unique values
it aggregates all values into an array.
Using an aggregation query, count the number of tweets for each user. In the same $group stage,
use $push to accumulate all the tweet texts for each user. Limit your output to the 5 users
with the most tweets.
Your result documents should include only the fields:
"_id" (screen name of user),
"count" (number of tweets found for the user),
"tweet_texts" (a list of the tweet texts found for the user).
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson,
the aggregation pipeline should be a list of one or more dictionary objects.
Please review the lesson examples if you are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided. If you want to run this code
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the twitter dataset used in
examples in this lesson. If you attempt some of the same queries that we looked at in the lesson
examples, your results will be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{'$group': {'_id' : 'user.screen_name',
'count': {'$sum': 1},
'tweet_texts': {'$push': 'text' }}},
{'$sort': {'count':-1}},
{'$limit':5}
]
return pipeline
def aggregate(db, pipeline):
result = db.tweets.aggregate(pipeline)
return result
if __name__ == '__main__':
db = get_db('twitter')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
assert len(result["result"]) == 5
assert result["result"][0]["count"] > result["result"][4]["count"]
import pprint
pprint.pprint(result)
In [14]:
%%writefile same_operator.py
#!/usr/bin/env python
"""
In an earlier exercise we looked at the cities dataset and asked which region in India contains
the most cities. In this exercise, we'd like you to answer a related question regarding regions in
India. What is the average city population for a region in India? Calculate your answer by first
finding the average population of cities in each region and then by calculating the average of the
regional averages.
Hint: If you want to accumulate using values from all input documents to a group stage, you may use
a constant as the value of the "_id" field. For example,
{ "$group" : {"_id" : "India Regional City Population Average",
... }
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson,
the aggregation pipeline should be a list of one or more dictionary objects.
Please review the lesson examples if you are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided. If you want to run this code
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the twitter dataset used
in examples in this lesson. If you attempt some of the same queries that we looked at in the lesson
examples, your results will be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{'$match' :{'country': 'India'}},
{'$unwind': '$isPartOf'},
{'$group': {'_id': '$isPartOf',
'avg': {'$avg' : '$population'}}},
{ "$group" : {"_id" : "India Regional City Population Average",
'avg' : {'$avg':'$avg'}}}
]
return pipeline
def aggregate(db, pipeline):
result = db.cities.aggregate(pipeline)
return result
if __name__ == '__main__':
db = get_db('examples')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
assert len(result["result"]) == 1
assert result["result"][0]["avg"] == 196025.97814809752
import pprint
pprint.pprint(result)
In [15]:
%%writefile most_common_city_name.py
#!/usr/bin/env python
"""
Use an aggregation query to answer the following question.
What is the most common city name in our cities collection?
Your first attempt probably identified None as the most frequently occurring city name.
What that actually means is that there are a number of cities without a name field at all.
It's strange that such documents would exist in this collection and, depending on your situation,
might actually warrant further cleaning.
To solve this problem the right way, we should really ignore cities that don't have a name specified.
As a hint ask yourself what pipeline operator allows us to simply filter input?
How do we test for the existence of a field?
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation pipeline
that can be passed to the MongoDB aggregate function. As in our examples in this lesson,
the aggregation pipeline should be a list of one or more dictionary objects.
Please review the lesson examples if you are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided.
If you want to run this code locally on your machine, you have to install MongoDB,
download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the cities collection used in
examples in this lesson. If you attempt some of the same queries that we looked at in the lesson
examples, your results may be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{'$match' : {'name' : {'$exists':1}}},
{'$group': {'_id' : '$name',
'count': {'$sum': 1}}},
{'$sort':{'count':-1}},
{'$limit':1}
]
return pipeline
def aggregate(db, pipeline):
result = db.cities.aggregate(pipeline)
return result
if __name__ == '__main__':
db = get_db('examples')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
import pprint
pprint.pprint(result["result"][0])
assert len(result["result"]) == 1
assert result["result"][0] == {'_id': 'Shahpur', 'count': 6}
In [17]:
%%writefile region_cities.py
#!/usr/bin/env python
"""
Use an aggregation query to answer the following question.
Which Region in India has the largest number of cities with longitude between 75 and 80?
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson,
the aggregation pipeline should be a list of one or more dictionary objects.
Please review the lesson examples if you are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided. If you want to run this
code locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the twitter dataset used in
examples in this lesson. If you attempt some of the same queries that we looked at in the lesson
examples, your results will be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{'$match': {'country': 'India',
'lon': {'$gte':75,
'$lt':80}}},
{'$unwind':'$isPartOf'},
{'$group' : {'_id':'$isPartOf',
'count' : {'$sum' : 1}}},
{'$sort': {'count':-1}},
{'$limit':1}
]
return pipeline
def aggregate(db, pipeline):
result = db.cities.aggregate(pipeline)
return result
if __name__ == '__main__':
db = get_db('examples')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
import pprint
pprint.pprint(result["result"][0])
assert len(result["result"]) == 1
assert result["result"][0]["_id"] == 'Tamil Nadu'
In [19]:
%%writefile average_population.py
#!/usr/bin/env python
"""
Use an aggregation query to answer the following question.
Extrapolating from an earlier exercise in this lesson, find the average regional city population
for all countries in the cities collection. What we are asking here is that you first calculate the
average city population for each region in a country and then calculate the average of all the
regional averages for a country. As a hint, _id fields in group stages need not be single values.
They can also be compound keys (documents composed of multiple fields). You will use the same
aggregation operator in more than one stage in writing this aggregation query. I encourage you to
write it one stage at a time and test after writing each stage.
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson,
the aggregation pipeline should be a list of one or more dictionary objects.
Please review the lesson examples if you are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided. If you want to run this code
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the cities collection used in
examples in this lesson. If you attempt some of the same queries that we looked at in the lesson
examples, your results may be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{'$match':{'country' : {'$exists':1}}},
{'$unwind': '$isPartOf'},
{'$group': {'_id': {'country':'$country',
'regions':'$isPartOf',
},
'avg': {'$avg' : '$population'}}},
{'$group':{'_id' : '$_id.country',
'avgRegionalPopulation' : {'$avg':'$avg'}}}
return pipeline
def aggregate(db, pipeline):
result = db.cities.aggregate(pipeline)
return result
if __name__ == '__main__':
db = get_db('examples')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
import pprint
if len(result["result"]) < 150:
pprint.pprint(result["result"])
else:
pprint.pprint(result["result"][:100])
for country in result["result"]:
if country["_id"] == 'Algeria':
assert country["_id"] == 'Algeria'
assert country["avgRegionalPopulation"] == 187590.19047619047
assert {'_id': 'Algeria',
'avgRegionalPopulation': 187590.19047619047} in result["result"]
In [ ]: